In [1]:
from IPython.display import Image, HTML, display, YouTubeVideo # IPython rich display Image.
# Ignoring deprecation warning messages.
import warnings
warnings.filterwarnings('ignore')

BIG DATA - Data Analysis


NoteBook Created by Raul E. Lopez Briega

relopezbriega@gmail.com

relopezbriega.com.ar

Solution Title: Data Pythonisa

Group: dotCOM

  • Captain: Raul Lopez Briega
  • Member 2:Stephanie Anglarill
  • Member 3:Daniel Garac y Gojac

In [10]:
Image('/home/raul/Ga_Tech/dotCom/dotComLOGO.png')


Out[10]:

Introduction

Data analysis is now critical to businesses strategy. Businesses increasingly are driven by data analytics, so there is great professional advantage in being able to interact with the vast amount of data of today world. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats.

For all the above, in this challenge, our goals are not only to help the NGO to find out the best strategy for its campaign but also to build an ultimate framework for dealing with the data analytics process.

Building our Framework

We wanted a general purpose and easy-to-use framework; so we decided that the key properties of our framework had to be:

  • Interactivity: our framework must allow interaction with the end-user; we did not want to create a batch process, we wanted something where you can input your questions and receive the immediately answer.

  • Visualization: when performing Data analysis, the possibility to visualize the results is critical; so we wanted that our framework allows easy and clear visualizations.

  • Easy-to-use: For us, the learning curve was an important factor; we wanted something easy to understand and learn; so we could start working with it right away.

  • Reporting: Reporting is another important factor, the results of any analysis worth nothing if no-one read it. So, our framework should give us easy-to-use reporting and sharing capabilities.

  • Extensions: we did not want a domain specific framework that only allow us to perform one kind of analysis and nothing more; we wanted something that allow us to extend our framework for another purpose and give us the freedom of choosing between different tools.

After a careful analysis of different options; we decided to use Python, with its extensions modules iPython, pandas, matplotlib, numpy and sci-kit learn.

Framework key components

The key components of the framework we propose, are:

IPython Notebook

This is the kernel of the framework. The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document:


In [13]:
Image('/home/raul/Pictures/IPython.png', width=500, height=500)


Out[13]:

IPython Notebook

IPython Notebook are normal files that can be shared with colleagues, converted to other formats such as HTML, PDF, or even Slide shows like this one.</br> Here is a short demo of the notebook’s basic features by the Pybonacci team:


In [2]:
HTML("""<iframe width="500" height="425"
src="http://www.youtube.com/embed/H6dLGQw9yFQ">
</iframe>""")


Out[2]:

Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

Pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.

Pandas

Key features:

  • Easy handling of missing data.
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects.
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically.
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets.
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
  • Intuitive merging and joining data sets.
  • Flexible reshaping and pivoting of data sets.
  • Hierarchical labeling of axes.
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5.
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Matplotlib

Matplotlib is the most popular Python library for producing plots and other 2D data visualizations.It integrates well with IPython, thus providing a comfortable interactive environment for plotting and exploring data. The plots are also interactive; you canzoom in on a section of the plot and pan around the plot using the toolbar in the plot window.

Some of the many advantages of this library includes:

  • Easy to get started.
  • Support for LATEX formatted labels and texts.
  • Great control of every element in a figure, including figure size and DPI.
  • High-quality output in many formats, including PNG, PDF, SVG, EPS.
  • GUI for interactively exploring figures and support for headless generation of figure files (useful for batch jobs).

Matplotlib

Here are some examples of the graphics we can create using Matplotlib


In [23]:
HTML("<iframe src=http://matplotlib.org/gallery.html#lines_bars_and_markers width=800 height=350></iframe>")


Out[23]:

Sci-kit learn

Sci-kit Learn is an open source machine learning library for the Python programming language.

Some of the machile learning problems with can handle with sci-kit learn, are:

  • Classifications: Identifying to which set of categories a new observation belong to.
  • Regressions: Predicting a continuous value for a new example.
  • A: Automatic grouping of similar objects into sets.
  • Dimensionality reduction: Reducing the number of random variables to consider.
  • Model selection: Comparing, validating and choosing parameters and models.
  • Preprocessing: Extracción de caracteristicas a analizar y

Sci-kit learn

Here are some examples of Sci-kit Learn


In [24]:
HTML("<iframe src=http://scikit-learn.org/stable/auto_examples/index.html width=800 height=350></iframe>")


Out[24]:

Finding the best strategy for the NGO campaign

Now it is time to start the description of the data mining process we followed for solving the NGO campaing challenge.

The phases we followed, are:

  • Data Cleansing

  • Data Exploration

  • Building the prediction model

Data Cleansing

The first step in any good knowledge discovery process is to clean the dataset. This is a important process because incorrect or inconsistent data can lead to false conclusions and misdirected actions. Our principal goals of this phase were, to complete the missing values, to detect the outliers and to remove the no necessary information.

For accomplishing this phase goals we use some of the build-in functions that python pandas module offers; we removed the no statistically significant columns from the dataset, created some more descriptive new columns, and identified some of the most important outliers.

Data Exploration

Once that our dataset is clean, we can continue with the next step; the exploratory phase; here our focus was in detecting the key factors and fields that give us a way to predict the donation behavior.

This phase is quite important because the only way to develop intuition for what is going on in an unfamiliar dataset is to immerse yourself into it.

In this phase we made an extensive use of visualizations; our goal for this phase was to get to know the data; we examined some data distributions, validated some assumptions and asked a lot of questions.

Some of the insights and understandings we gained during this phase were:

  1. The most significant variables for predicting a customer’s donation behavior are the previous donation behavior summaries.
  2. The demographics data turns out to be quite strongly connected to the donation performance of the population.
  3. Identifying donors is a thoroughly different task than maximizing donation. There is an inverse correlation between likelihood to respond and the dollar amount of the gift.

Building the prediction model

With all the information and knowledge we gained from the exploratory phase, we were ready to start building a model to test our assumptions and try to predict the donor’s behavior.

We first started with a single model; to build this model, we have created 7 segmentsfrom the different insight we got from the exploration data analysis. These segments are:

  1. MAXRAMNT > 30
  2. RAMNTALL > 250
  3. HV2 > 1600 and AGE between 30 and 60.
  4. EC8 > 12
  5. IC4 > 800
  6. RAMNT_3 > 3.5
  7. STATE in ('CA', 'FL', 'MI')

Applying this single model to the dataset, we got a profits improvement of 50 %.

Building the prediction model

We got great results with the single model, but we did not stop there, then we tried to build a more complex model using the Random forest machine learning algorithm to predict the results. We use almost the same variables as future selections to apply to the algorithm; with this new model, we got a profits improvement of 650 %.

The report with all our analysis, could be found here.


In [5]:
Image('/home/raul/Pictures/IPython.png', width=500, height=500)


Out[5]: